2.5 2D distributions

2.5.1 Rectangular binning in plotly.js

The plotly package provides two functions for displaying rectangular bins: add_heatmap() and add_histogram2d(). For numeric data, the add_heatmap() function is a 2D analog of add_bars() (bins must be pre-computed), and the add_histogram2d() function is a 2D analog of add_histogram() (bins can be computed in the browser). Thus, I recommend add_histogram2d() for exploratory purposes, since you don’t have to think about how to perform binning. It also provides a useful zsmooth attribute for effectively increasing the number of bins (currently, “best” performs a bi-linear interpolation, a type of nearest neighbors algorithm), and nbinsx/nbinsy attributes to set the number of bins in the x and/or y directions. Figure ?? compares three different uses of add_histogram(): (1) plotly.js’ default binning algorithm, (2) the default plus smoothing, (3) setting the number of bins in the x and y directions.

p <- plot_ly(diamonds, x = ~log(carat), y = ~log(price))
subplot(
  add_histogram2d(p) %>%
    colorbar(title = "default", len = 1/3, y = 1) %>%
    layout(xaxis = list(title = "default")),
  add_histogram2d(p, zsmooth = "best") %>%
    colorbar(title = "zsmooth", len = 1/3, y = 2/3 - 0.05) %>%
    layout(xaxis = list(title = "zsmooth")),
  add_histogram2d(p, nbinsx = 60, nbinsy = 60) %>%
    colorbar(title = "nbins", len = 1/3, y = 1/3 - 0.1) %>%
    layout(xaxis = list(title = "nbins")),
  shareY = TRUE, titleX = TRUE
)

2.5.2 Rectangular binning in R

In Bars & histograms, we leveraged a number of algorithms for computing the “optimal” number of bins in 1D by routing the results of hist() into add_bars(). There is a surprising lack of academic research and computational tools for the 2D analog. Among the research that does exist, solutions usually depend on characteristics of the unknown underlying distribution, so a sensible approach is to assume a Gaussian form (David W. Scott 1992). Practically speaking, that assumption is not very useful, but thankfully kernel density estimation, and specifically the kde2d() function from the MASS package provides a well-supported rule-of-thumb.

kde_count <- function(x, y, ...) {
  kde <- MASS::kde2d(x, y, ...)
  df <- with(kde, setNames(expand.grid(x, y), c("x", "y")))
  df$count <- with(kde, c(z) * length(x) * diff(x)[1] * diff(y)[1])
  data.frame(df)
}

kd <- with(diamonds, kde_count(log(carat), log(price), n = 30))
plot_ly(kd, x = ~x, y = ~y, z = ~count) %>% 
  add_heatmap() %>%
  colorbar(title = "Number of diamonds")

2.5.3 Basic heatmaps

The add_heatmap() function can also handle categorical x and y variables.

corr <- cor(diamonds[vapply(diamonds, is.numeric, logical(1))])
plot_ly(x = rownames(corr), y = colnames(corr), z = corr) %>%
  add_heatmap() %>%
  colorbar(limits = c(-1, 1))

A heatmap with categorical axes is basically a two-way table. The Titanic dataset is a four-way table that we can put into a tidy data frame using the tidy() function from the broom package.

broom::tidy(Titanic) %>%
  plot_ly(x = ~interaction(Age, Survived), y = ~interaction(Class, Sex)) %>%
  add_heatmap(z = ~Freq)


#one_heatmap <- function(d) {
#  plot_ly(d, x = ~Class, y = ~Sex) %>% 
#    add_heatmap(z = ~Freq) %>%
#    add_annotations(
#      ~unique(paste(Age, Survived, sep = ", ")), x = 0.5, y = 1, 
#      yanchor = "bottom", xanchor = "middle",
#      xref = "paper", yref = "paper", showarrow = FALSE
#    )
#}
#
#tidy(Titanic) %>%
#  group_by(Age, Survived) %>%
#  do(p = one_heatmap(.)) %>%
#  subplot(
#    nrows = 2, margin = c(0.01, 0.01, 0.04, 0.04),
#    shareX = TRUE, shareY = TRUE, 
#    titleY = FALSE, titleX = FALSE
#  )

2.5.4 Contours

plot_ly(diamonds, x = ~cut, y = ~clarity) %>% 
  add_histogram2dcontour()